Approximate NN queries on Streams with Guaranteed Error/performance Bounds
نویسندگان
چکیده
In data stream applications, data arrive continuously and can only be scanned once as the query processor has very limited memory (relative to the size of the stream) to work with. Hence, queries on data streams do not have access to the entire data set and query answers are typically approximate. While there have been many studies on the k Nearest Neighbors (kNN) problem in conventional multidimensional databases, the solutions cannot be directly applied to data streams for the above reasons. In this paper, we investigate the kNN problem over data streams. We first introduce the e-approximate kNN (ekNN) problem that finds the approximate kNN answers of a query point Q such that the absolute error of the k-th nearest neighbor distance is bounded by e. To support ekNN queries over streams, we propose a technique called DISC (aDaptive Indexing on Streams by space-filling Curves). DISC can adapt to different data distributions to either (a) optimize memory utilization to answer ekNN queries under certain accuracy requirements or (b) achieve the best accuracy under a given memory constraint. At the same time, DISC provide efficient updates and query processing which are important requirements in data stream applications. Extensive experiments were conducted using both synthetic and real data sets and the results confirm the effectiveness and efficiency of DISC. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004
منابع مشابه
Sketching Streams Through the Net: Distributed Approximate Query Tracking
Emerging large-scale monitoring applications require continuous tracking of complex dataanalysis queries over collections of physicallydistributed streams. Effective solutions have to be simultaneously space/time efficient (at each remote monitor site), communication efficient (across the underlying communication network), and provide continuous, guaranteed-quality approximate query answers. In...
متن کاملContinuous Distributed Stream Querying using Sketches1
While traditional database systems optimize for performance on one-shot query processing, emerging largescale monitoring applications require continuous tracking of complex data-analysis queries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space/time efficient (at each remote monitor site), communication efficient (across the underlying...
متن کاملSynopsis Construction in Data Streams
Unlike traditional data sets, stream data flow in and out of a computer system continuously and with varying update rates. It may be impossible to store an entire data stream due to its tremendous volume. To discover knowledge or patterns from data streams, it is necessary to develop data stream summarization techniques. Lots of work has been done to summarize the contents of data streams in or...
متن کاملEfficient Approximation of Correlated Sums on Data Streams
In many applications such as IP network management, data arrives in streams, and queries over those streams need to be processed online using limited storage. Correlated-sum (CS) aggregates are a natural class of queries formed by composing basic aggregates on (x, y) pairs, and are of the form SUM{g(y) : x ≤ f(AGG(x))}, where AGG(x) can be any basic aggregate and f(), g() are user-specified fun...
متن کاملEnabling epsilon-Approximate Querying in Sensor Networks
Data approximation is a popular means to support energy-efficient query processing in sensor networks. Conventional data approximation methods require users to specify fixed error bounds a prior to address the trade-off between result accuracy and energy efficiency of queries. We argue that this can be infeasible and inefficient when, as in many real-world scenarios, users are unable to determi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004